For this project, we’ll be exploring the data about Covid-19 Vaccinations gathered by “Our World In Data”.
We’ll be using the Data on COVID-19 (coronavirus) vaccinations from Our World in Data. For details of the data set, see the Further Information section at the end of this document.
The data is updated daily, so if the file should be downloaded if it either does not exist locally, or has not been modified for more than 1 day.
source_location <- 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
local_file_location <- 'Data/covid_Vaccination_data.csv'
result <- download.file(source_location, destfile = local_file_location,
method="wininet", quiet = TRUE)
Let’s start by reading in the data from the local CSV file, and having a look at the structure of the data:
vaccination_data <- read.csv(local_file_location)
str(vaccination_data, vec.len = 3)
## 'data.frame': 9818 obs. of 12 variables:
## $ location : chr "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ iso_code : chr "AFG" "AFG" "AFG" ...
## $ date : chr "2021-02-22" "2021-02-23" "2021-02-24" ...
## $ total_vaccinations : int 0 NA NA NA NA NA 8200 NA ...
## $ people_vaccinated : int 0 NA NA NA NA NA 8200 NA ...
## $ people_fully_vaccinated : int NA NA NA NA NA NA NA NA ...
## $ daily_vaccinations_raw : int NA NA NA NA NA NA NA NA ...
## $ daily_vaccinations : int NA 1367 1367 1367 1367 1367 1367 1580 ...
## $ total_vaccinations_per_hundred : num 0 NA NA NA NA NA 0.02 NA ...
## $ people_vaccinated_per_hundred : num 0 NA NA NA NA NA 0.02 NA ...
## $ people_fully_vaccinated_per_hundred: num NA NA NA NA NA NA NA NA ...
## $ daily_vaccinations_per_million : int NA 35 35 35 35 35 35 41 ...
Our World in Data update this csv file once per day, with one observation (row) added each day for each country which has updated vaccination statistics that day.
The first three variables (columns) are fairly straightforward to understand:
location: name of the country or other location to which the results applyiso-code: a unique identifier for the location.This is based on ISO-3166 standards **link and will allow us to link this data to other data setsdate: the date that the information was recordedMany of the vaccines require more than one dose to be fully effective, so several different variables are being tracked:
total_vaccinations: total number of doses administered. If a person receives one dose of the vaccine, this metric goes up by 1. If they receive a second dose, it goes up by 1 again.people_vaccinated: total number of people who have received at least one vaccine dose. If a person receives the first dose of a 2-dose vaccine, this metric goes up by 1. If they receive the second dose, the metric stays the same.people_fully_vaccinated: total number of people who have received all doses prescribed by the vaccination protocol. If a person receives the first dose of a 2-dose vaccine, this metric stays the same. If they receive the second dose, the metric goes up by 1.There are also separate measures for daily_vaccinations_raw and daily_vaccinations. The raw figure is ‘provided for data checks and transparency’ and Our World in Data recommend using daily_vaccinations instead.
We won’t be using the daily data, so we can remove those columns:
cleaned_vacc_data <- vaccination_data %>%
select(!starts_with('daily'))
str(cleaned_vacc_data, vec.len = 3)
## 'data.frame': 9818 obs. of 9 variables:
## $ location : chr "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ iso_code : chr "AFG" "AFG" "AFG" ...
## $ date : chr "2021-02-22" "2021-02-23" "2021-02-24" ...
## $ total_vaccinations : int 0 NA NA NA NA NA 8200 NA ...
## $ people_vaccinated : int 0 NA NA NA NA NA 8200 NA ...
## $ people_fully_vaccinated : int NA NA NA NA NA NA NA NA ...
## $ total_vaccinations_per_hundred : num 0 NA NA NA NA NA 0.02 NA ...
## $ people_vaccinated_per_hundred : num 0 NA NA NA NA NA 0.02 NA ...
## $ people_fully_vaccinated_per_hundred: num NA NA NA NA NA NA NA NA ...
The date variable is being stored in character format, so it should to be converted to a Date.
cleaned_vacc_data %<>%
mutate(date = as.Date(date, format= "%Y-%m-%d"))
cleaned_vacc_data
Browsing through the data, we can see that different countries started reporting their vaccinations at different times, and some have gaps between records.
Let’s summarise the first and last date when each country recorded vaccination numbers, along with the total number of reports made in that time span:
vacc_report_dates_by_country <-
cleaned_vacc_data %>%
group_by(location) %>%
summarise(first_vacc_record = min(date),
last_vacc_record = max(date),
date_span = difftime(last_vacc_record,
first_vacc_record,
units = 'days') + 1,
vacc_report_count = n_distinct(date))
vacc_report_dates_by_country
It’s pretty easy to see that the countries have different dates for their first and last record, but for most countries the date_span and vacc_report_count is the same, meaning that there is a record for every day during that time. Sometimes there are exceptions, and there will be days with missing data:
vacc_report_dates_by_country %>%
filter(difftime(last_vacc_record,
first_vacc_record,
units = 'days')
- vacc_report_count > 0)
We want to know the total figures for each country on any given date, so we’ll need to generate records for the missing dates. We can do this using the complete() function from dplyr:
complete_vacc_data <-
cleaned_vacc_data %>%
complete(location, date = seq.Date(min(date), max(date), by = 'day') )
complete_vacc_data
summary(complete_vacc_data)
## location date iso_code total_vaccinations
## Length:18252 Min. :2020-12-13 Length:18252 Min. : 0
## Class :character 1st Qu.:2021-01-08 Class :character 1st Qu.: 51678
## Mode :character Median :2021-02-04 Mode :character Median : 402115
## Mean :2021-02-04 Mean : 8927709
## 3rd Qu.:2021-03-03 3rd Qu.: 2541835
## Max. :2021-03-30 Max. :577923364
## NA's :12032
## people_vaccinated people_fully_vaccinated total_vaccinations_per_hundred
## Min. : 0 Min. : 1 Min. : 0.000
## 1st Qu.: 47030 1st Qu.: 23380 1st Qu.: 0.728
## Median : 334959 Median : 189671 Median : 3.720
## Mean : 6102907 Mean : 2860705 Mean : 10.048
## 3rd Qu.: 2046726 3rd Qu.: 1008912 3rd Qu.: 11.412
## Max. :327158278 Max. :126779833 Max. :175.270
## NA's :12606 NA's :14295 NA's :12032
## people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.640 1st Qu.: 0.320
## Median : 3.050 Median : 1.390
## Mean : 7.424 Mean : 3.562
## 3rd Qu.: 8.610 3rd Qu.: 3.350
## Max. :92.300 Max. :82.970
## NA's :12606 NA's :14295
The summary shows that we now have NA values in all of our numeric columns, and the table of data shows that there are also values in the iso_code column.
When we completed the missing dates, the corresponding ISO Codes were not filled in. We can add these by getting a data frame with each location and ISO code (from the original data), then joining that to the complete_vacc_data data
iso_codes <- vaccination_data %>%
distinct(location, iso_code)
complete_vacc_data %<>% select(-iso_code)
complete_vacc_data <- left_join(complete_vacc_data, iso_codes,
by = 'location')
complete_vacc_data
The vaccination figures left in our data are running totals, which means that we would not expect them to decrease. We won’t make any assumptions about how many vaccinations were performed on any days where there is no data, but will use the tidyr fill function to copy totals down from the last day with data:
complete_vacc_data %<>%
arrange(location, date) %>%
group_by(location) %>%
fill(c(total_vaccinations, people_vaccinated, people_fully_vaccinated,
total_vaccinations_per_hundred, people_vaccinated_per_hundred,
people_fully_vaccinated_per_hundred, people_fully_vaccinated_per_hundred))
complete_vacc_data
We now have totals for every country once they report their first figures, but before that the values will be NA. Again, we won’t make any assumptions about any vaccinations before the first report, so we’ll set these values to 0.
complete_vacc_data[is.na(complete_vacc_data)] = 0
complete_vacc_data
Some of our data is not for sovereign nations which mean they don’t have valid ISO Codes. We’ll store that data separately before removing it from our main data frame:
non_nation_data <- complete_vacc_data %>%
filter(!nchar(iso_code) == 3)
non_nation_data
non_nation_data %>% distinct(iso_code)
continent_vacc_data <- non_nation_data %>%
filter(location %in% c('Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America'))
continent_vacc_data
national_vacc_data <- complete_vacc_data %>%
filter(nchar(iso_code) == 3)
national_vacc_data
We now have 3 data frames that we may want to use for reporting and visualising our data:
We will complete the data cleaning process by converting the 2 string columns to factors. We’ll do this for all 3 data frames:
complete_vacc_data %<>%
mutate(location = as.factor(location)) %>%
mutate(iso_code = as.factor(iso_code))
national_vacc_data %<>%
mutate(location = as.factor(location)) %>%
mutate(iso_code = as.factor(iso_code))
continent_vacc_data %<>%
mutate(location = as.factor(location)) %>%
mutate(iso_code = as.factor(iso_code))
Description of the Our World in Data COVID-19 Vaccination data: https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations